Error Rate Estimate for Cluster Data – Application to Automatic Spoken Language Identification
نویسندگان
چکیده
If the dataset available to machine learning results from cluster sampling, the usual cross-validation error rate estimate can lead to biased and misleading results. An adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, are compared to the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, and not from a random partition of the examples. The results are confirmed on a true application of automatic spoken language identification.
منابع مشابه
Accuracy Estimation With Clustered Dataset
If the dataset available to machine learning results from cluster sampling (e.g. patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. An adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random samplin...
متن کاملAutomatic Language Identification with Discriminative Language Characterization Based on SVM
Robust automatic language identification (LID) is the task of identifying the language from a short utterance spoken by an unknown speaker. The mainstream approaches include parallel phone recognition language modeling (PPRLM), support vector machine (SVM) and the general Gaussian mixture models (GMMs). These systems map the cepstral features of spoken utterances into high level scores by class...
متن کاملUnsupervised adaptation for acoustic language identification
Our system for automatic language identification (LID) of spoken utterances is performed with language dependent parallel phoneme recognition (PPR) using Hidden Markov Model (HMM) phoneme recognizers and optional phoneme language models (LMs). Such a LID system for continuous speech requires many hours of orthographically transcribed data for training of language dependent HMMs and LMs as well ...
متن کاملA Real-Time Spoken-Language System for Interactive Problem Solving
SRI has developed a spoken language system to retrieve air travel planning information. Progress can be measured by comparing DARPA benchmark results in February 1992 and November 1992. Between February 1992 and November 1992, for all utterances tested, SRJ's word error rate in the ATIS speech recognition test improved from 11.0% to 9.1%. Weighted utterance error improved from 31.1% to 23.6% in...
متن کاملLanguage model adaptation for conversational speech recognition using automatically tagged pseudo-morphological classes
Statistical language models provide a powerful tool to model natural spoken language. Nevertheless it is required a large set of training sentences to reliably estimate the model parameters. In this paper we present a method to estimate n-gram probabilities from sparse data. The proposed language modeling strategy allows to adapt a generic language model (LM) to a new semantic domain with just ...
متن کامل